Inroduction:

First loading all possible libraries that I might use.

Notice that the parameter “echo” was set to FALSE for this code chunk. This prevents the code from displaying in the knitted HTML output. You should set echo=FALSE for all code chunks in your file, unless it makes sense for your report to show the code that generated a particular plot.

The other parameters for “message” and “warning” should also be set to FALSE for other code chunks once you have verified that each plot comes out as you want it to. This will clean up the flow of your report.

Importing the data

For the purpose of this project I have chosen to work with Prosper Loan Data set. I hope to find interesting insight in this domain(loan/credit system) for example who are the suitable customers eligible for a loan, has there been any fraud so cases like these. So to get started first we load the dataset.

## [1] 113937     81

So after loading the data we find that there are 113937 rows/observations and 81 variables/columns.

Now there is a loan_data_dict file which contains the information about each variable. For the purpose of this project I am choosing 36 variables which I think is important for exploration purposes. So next step is to subset the data

##  [1] "ListingKey"                         
##  [2] "ListingNumber"                      
##  [3] "ListingCreationDate"                
##  [4] "CreditGrade"                        
##  [5] "Term"                               
##  [6] "LoanStatus"                         
##  [7] "ClosedDate"                         
##  [8] "BorrowerAPR"                        
##  [9] "BorrowerRate"                       
## [10] "LenderYield"                        
## [11] "EstimatedEffectiveYield"            
## [12] "EstimatedLoss"                      
## [13] "EstimatedReturn"                    
## [14] "ProsperRating (numeric)"            
## [15] "ProsperRating (Alpha)"              
## [16] "ProsperScore"                       
## [17] "ListingCategory (numeric)"          
## [18] "BorrowerState"                      
## [19] "Occupation"                         
## [20] "EmploymentStatus"                   
## [21] "EmploymentStatusDuration"           
## [22] "IsBorrowerHomeowner"                
## [23] "CurrentlyInGroup"                   
## [24] "GroupKey"                           
## [25] "DateCreditPulled"                   
## [26] "CreditScoreRangeLower"              
## [27] "CreditScoreRangeUpper"              
## [28] "FirstRecordedCreditLine"            
## [29] "CurrentCreditLines"                 
## [30] "OpenCreditLines"                    
## [31] "TotalCreditLinespast7years"         
## [32] "OpenRevolvingAccounts"              
## [33] "OpenRevolvingMonthlyPayment"        
## [34] "InquiriesLast6Months"               
## [35] "TotalInquiries"                     
## [36] "CurrentDelinquencies"               
## [37] "AmountDelinquent"                   
## [38] "DelinquenciesLast7Years"            
## [39] "PublicRecordsLast10Years"           
## [40] "PublicRecordsLast12Months"          
## [41] "RevolvingCreditBalance"             
## [42] "BankcardUtilization"                
## [43] "AvailableBankcardCredit"            
## [44] "TotalTrades"                        
## [45] "TradesNeverDelinquent (percentage)" 
## [46] "TradesOpenedLast6Months"            
## [47] "DebtToIncomeRatio"                  
## [48] "IncomeRange"                        
## [49] "IncomeVerifiable"                   
## [50] "StatedMonthlyIncome"                
## [51] "LoanKey"                            
## [52] "TotalProsperLoans"                  
## [53] "TotalProsperPaymentsBilled"         
## [54] "OnTimeProsperPayments"              
## [55] "ProsperPaymentsLessThanOneMonthLate"
## [56] "ProsperPaymentsOneMonthPlusLate"    
## [57] "ProsperPrincipalBorrowed"           
## [58] "ProsperPrincipalOutstanding"        
## [59] "ScorexChangeAtTimeOfListing"        
## [60] "LoanCurrentDaysDelinquent"          
## [61] "LoanFirstDefaultedCycleNumber"      
## [62] "LoanMonthsSinceOrigination"         
## [63] "LoanNumber"                         
## [64] "LoanOriginalAmount"                 
## [65] "LoanOriginationDate"                
## [66] "LoanOriginationQuarter"             
## [67] "MemberKey"                          
## [68] "MonthlyLoanPayment"                 
## [69] "LP_CustomerPayments"                
## [70] "LP_CustomerPrincipalPayments"       
## [71] "LP_InterestandFees"                 
## [72] "LP_ServiceFees"                     
## [73] "LP_CollectionFees"                  
## [74] "LP_GrossPrincipalLoss"              
## [75] "LP_NetPrincipalLoss"                
## [76] "LP_NonPrincipalRecoverypayments"    
## [77] "PercentFunded"                      
## [78] "Recommendations"                    
## [79] "InvestmentFromFriendsCount"         
## [80] "InvestmentFromFriendsAmount"        
## [81] "Investors"
# Number of Unique Records
distinct_loan_records <- length(unique(prosperLoans.sub$ListingNumber))
distinct_loan_records
## [1] 113066
# After subsetting the data, I wanted to know the unique number of records. I took ListingNumber as the field to check unique customers as it was a parameter which identified a loan borrower. But surprisingly I found that there are 11366 distinct records for a user which means that there may be duplicate/updated data for some users in our data set, Since the total number of rows are 113937.

# Number of Duplicate Records
duplicate_loan_records <- length(unique(prosperLoans.sub$ListingNumber
[duplicated(prosperLoans.sub$ListingNumber)])) 

#Saving the duplicate records in a new variable
duplicate_loans <- unique(prosperLoans.sub$ListingNumber
[duplicated(prosperLoans.sub$ListingNumber)])

duplicate_loans
##   [1] 1023355 1055932 1097584  927246  920397  894748 1066466  875616
##   [9]  969821  970570 1047993 1138422 1100572 1057901 1101199  898052
##  [17]  890514 1069928 1063352  900913 1116810  877625 1114591  925803
##  [25] 1027003 1111486 1138731 1077775  898757  990994 1044353  951186
##  [33] 1090596  926217 1116297  953329 1025631  877117  985290  919724
##  [41]  958819 1097279  945736 1059427 1015904 1202163  928642 1102061
##  [49] 1110415 1050418 1031431 1184734  891025  926813 1035457  906320
##  [57]  989195 1116724 1149692 1071533 1017921  940961 1119836  889018
##  [65] 1160473 1177596 1176594 1055942  997199  966176  902051 1053715
##  [73] 1134330 1031508  875651 1118868  908353  960603  930618  956743
##  [81] 1233732 1025541  918278  919623  844220  877719 1089145 1087743
##  [89] 1024510 1017753  936442  882888  901662  946929 1204344  974533
##  [97] 1200566 1055661  998257  879829  941296  944122 1190614 1190590
## [105]  931467  701942  876423 1166576 1131625 1136886  961598 1082906
## [113] 1092437 1052481 1167913 1138904 1037880 1237178 1189349  864114
## [121]  875249 1098094 1114520 1050318 1096149  932440 1151968  902157
## [129]  988904 1095113 1076428 1039105 1113064 1093115  880578  901650
## [137] 1021236 1080955 1079195  964692 1030511  855637  958929  990936
## [145]  875737 1148765 1030162 1086421 1044620 1124382 1014086 1021622
## [153] 1018488 1031370  879606 1123619 1102731  968180 1026732  857826
## [161] 1054113 1115327 1158389 1010477 1155587 1045723 1074971 1148799
## [169] 1121686  895686 1026555  990568 1143894 1165181  950042  990877
## [177] 1134181 1138460 1131123 1112691 1177424 1199490 1120459 1094962
## [185] 1028087  923713 1055568  949426 1056749 1151314  899615  956166
## [193] 1088191  908557 1094490  970565  915278 1170591  998232 1067478
## [201] 1103195 1054672  893227 1126419  989749 1119198 1136469  968319
## [209] 1125972 1154037  893464  812537 1087994 1108543  908028  914025
## [217] 1035191 1088101  930295  932593  851219  926425 1124568 1020751
## [225]  853665 1119803  987075 1121952  981333 1056941 1109263 1155979
## [233]  953464  892845  948269 1099457 1195343 1018930  970538 1104872
## [241] 1054732 1061584  960299 1154254  930842 1064438  986199  899140
## [249] 1087229  954927 1230428  990440 1122473  925162  905592  974745
## [257]  899828 1102482  836614 1169996  961245 1235007 1095372 1046027
## [265]  924066 1135450  931227 1107213  996510 1193829  925931 1137605
## [273] 1091297 1090653  924615  940142 1219068 1146677  930235  924637
## [281] 1108628 1072695 1018279 1080937 1109558 1026097 1212125 1006866
## [289] 1035245  905480 1015023 1080130  971971 1101080 1063589  972477
## [297] 1055627  881558  895558 1014296 1156309 1105838 1114920  959753
## [305] 1105397 1193025  894308 1153930 1133065 1108921 1005992  908476
## [313] 1124731  843648 1104667 1216695 1187057  866202  885990  924628
## [321]  887810  875666 1120935  911586  955360 1087132  895459 1075237
## [329] 1083058  950607 1105211 1091779 1226289 1188772  984135 1012949
## [337] 1038363 1198515  985955 1171593 1152960  908000  918514 1101815
## [345]  978521 1066485  950953  899188 1055937  941065 1096697  938171
## [353] 1129082 1087480 1093077 1060452 1190639  996453  951647 1189662
## [361]  917286  861876 1035339 1127013  902703 1177661 1008145 1142914
## [369]  929177  890186 1105533 1079945  889172 1052711 1134173  941636
## [377] 1032488 1027633 1151090 1220787 1040968  929942  919601 1192167
## [385] 1128993 1063482  977573  966050 1048472  944577  897127  898738
## [393] 1058741  990861  858364  976387 1173207  900683 1050774 1027971
## [401]  935199  877050 1166367  948342 1030359  921976 1127419 1204854
## [409] 1034092  927080 1198260 1142723 1099029  957621 1155578  890626
## [417] 1032477 1116107  995810 1099966 1048880  945525 1053022  997526
## [425]  977465 1173194 1167951  909248 1033614 1078459 1002118 1049543
## [433] 1133843 1026058 1105032  873019 1035968 1149519 1070208 1089453
## [441]  858972  979693  912341 1175222  934674  942184  951130 1040872
## [449] 1139308 1015624  936523 1042913 1237755 1138811 1012357 1068698
## [457]  910870  910395  920632  907826 1195494 1193526  934888 1104995
## [465] 1083653  906461  982609 1075677 1108589 1044047  918397 1030503
## [473] 1129110  976849  933684 1071984  972553 1132599  968628 1161127
## [481]  917385 1137949 1181090 1009580 1163161  928893 1056857  963475
## [489]  904644 1032668 1035436 1022216  944328 1106262 1018790  918706
## [497] 1033746 1147337 1159676 1026847  910144 1106190  916824 1118334
## [505]  935390 1086699 1139790  990978  975021 1153116 1098485 1171773
## [513]  900183 1042284 1135711  955723 1036715 1115376 1075423  859687
## [521] 1107254  935509  965062  973472  891204 1024450 1052645  870435
## [529]  948890  996569 1134224 1156730 1092244 1034392 1084783  815571
## [537]  945349 1182462  999405  864112 1047212  925440 1187090  960750
## [545]  985786  941212  786407  870200 1199292 1145814 1218234  893972
## [553]  959341 1037620 1030987 1140662 1060954  891559 1082764  964890
## [561] 1042325 1018607 1202259  933612  822680  936550 1120346  943121
## [569] 1176642  891086  864367 1040426  920831 1119401 1124318 1148010
## [577]  915816  909549 1036380  991912 1104523 1074688 1208246  883224
## [585] 1125268 1135094  928970 1198955 1085063 1061117  981714  895539
## [593] 1017064  968296 1055833  918138 1168637 1098360 1143717 1029688
## [601]  966565  997774 1058244  997194 1039311 1191338  955717 1229526
## [609]  922934  964598  974060  917956 1022689 1224227 1169331 1094929
## [617] 1031134  954024  975596  901736 1035967  920137 1202616 1111002
## [625]  917584  929915 1020649 1144412  989379 1147593  900306  903152
## [633] 1220004 1049727  880347  892078  948758 1036697  954385  971814
## [641] 1092833  871221 1033925 1210766 1080072 1107812 1047291 1013531
## [649] 1144803  932783  876871 1077016  896302 1029573 1113053 1226520
## [657] 1011660  946122 1055666 1201145 1171682 1189679  934107 1001424
## [665] 1209237 1108405  937425 1014533 1150451 1095389 1033033  932443
## [673]  901899 1070052 1112653  955359 1081388 1126153  904364 1209105
## [681]  991895 1026283  983809 1212311 1224533 1115681 1069263  896913
## [689]  993074 1131669  859848 1100742  932923 1129327 1036638  963489
## [697] 1127948  887683 1187075 1035715  956707  884082  964344 1138016
## [705]  951488  904382  964381  900607  927487 1016680  880017 1042995
## [713] 1031588 1197627 1240650 1120250  984017 1230959  962425 1148810
## [721] 1051243 1028176 1022898 1185586 1187684  884368  929848  958521
## [729] 1094025 1051319 1012256  979235  899894 1029460  879634 1009856
## [737]  879245  903826  983041  941374 1196181  989166 1212242 1036020
## [745] 1092866 1194914  996641  980406 1024431 1159061  949677  970092
## [753]  923110 1159834  924418  949840 1125929  929393 1091627 1021028
## [761] 1034667  976026 1056808 1140825 1004014  960183  925275  941303
## [769] 1090829 1000151 1088912  998446  930906 1193964  943141  852758
## [777] 1031589  967122 1080320  897290 1078902  927999  900792  931671
## [785] 1130881  973282  985579  905260 1142120  884862  960170 1116581
## [793]  959620  935518 1090758  994090 1051425 1009196  905371 1049386
## [801] 1065219 1128831 1048512  996931  827053  916391  978862 1184373
## [809] 1188953  969859  880829 1054075 1221869 1092778 1060784 1088357
## [817] 1091759  955489  942150 1197011  983181 1163997 1071926  933395
## [825] 1158474  904020 1002326
# So we find 827 entries which have duplicate records. So for my analysis I needed to reject these customers as they would bring additional details in my observation which may not be very useful. 

Removing unwanted records

Univariate Plots Section

After the data cleaning process, now I can proceed with the analysis. First I am checking single variables and their behavior. So first lets check whats the summary of loan taken and its statistic by exploring the price range in which maximum loan is taken.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6100    8293   12000   35000

We can see that there is an outlier in the plot after 30000.

## 'data.frame':    112239 obs. of  33 variables:
##  $ ListingNumber              : int  193129 1209647 81716 658116 909464 1074836 750899 768193 587746 213551 ...
##  $ ListingCreationDate        : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 54939 15374 ...
##  $ Term                       : int  36 36 36 36 36 60 36 36 60 36 ...
##  $ LoanStatus                 : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 3 ...
##  $ ClosedDate                 : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 538 ...
##  $ BorrowerRate               : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ ProsperRating (numeric)    : int  NA 6 NA 6 3 5 2 4 4 NA ...
##  $ ProsperRating (Alpha)      : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 5 1 ...
##  $ ProsperScore               : num  NA 7 NA 9 4 10 2 4 7 NA ...
##  $ ListingCategory (numeric)  : int  0 2 0 16 2 1 1 2 1 0 ...
##  $ BorrowerState              : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 22 1 ...
##  $ EmploymentStatus           : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 3 ...
##  $ IsBorrowerHomeowner        : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 1 1 ...
##  $ CurrentlyInGroup           : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
##  $ CurrentCreditLines         : int  5 14 NA 5 19 21 10 6 2 2 ...
##  $ TotalCreditLinespast7years : int  12 29 3 29 49 49 20 10 56 10 ...
##  $ OpenRevolvingAccounts      : int  1 13 0 7 6 13 6 5 4 1 ...
##  $ OpenRevolvingMonthlyPayment: num  24 389 0 115 220 1410 214 101 25 40 ...
##  $ TotalInquiries             : num  3 5 1 1 9 2 0 16 2 5 ...
##  $ CurrentDelinquencies       : int  2 0 1 4 0 0 0 0 1 3 ...
##  $ AmountDelinquent           : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years    : int  4 0 0 14 0 0 0 0 28 1 ...
##  $ PublicRecordsLast10Years   : int  0 1 0 0 0 0 0 1 1 0 ...
##  $ RevolvingCreditBalance     : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization        : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.51 0.32 ...
##  $ TotalTrades                : num  11 29 NA 26 39 47 16 10 47 7 ...
##  $ DebtToIncomeRatio          : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.12 0.27 ...
##  $ IncomeRange                : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 6 2 ...
##  $ IncomeVerifiable           : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome        : num  3083 6125 2083 2875 9583 ...
##  $ LoanOriginalAmount         : int  9425 10000 3001 10000 15000 15000 3000 10000 13500 1000 ...
##  $ Investors                  : int  258 1 41 158 20 1 1 1 19 53 ...

## 
##              Cancelled             Chargedoff              Completed 
##                      5                  11992                  38048 
##                Current              Defaulted FinalPaymentInProgress 
##                  54928                   5018                    201 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                     16                    794                    265 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                    359                    309                    304

## 
##             $0      $1-24,999      $100,000+ $25,000-49,999 $50,000-74,999 
##            621           7209          17046          31703          30463 
## $75,000-99,999  Not displayed   Not employed 
##          16650           7741            806

Now lets see the distribution of loan status across income. This will be interesting

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 79 rows containing non-finite values (stat_bin).

## 'data.frame':    112239 obs. of  33 variables:
##  $ ListingNumber              : int  193129 1209647 81716 658116 909464 1074836 750899 768193 587746 213551 ...
##  $ ListingCreationDate        : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 54939 15374 ...
##  $ Term                       : int  36 36 36 36 36 60 36 36 60 36 ...
##  $ LoanStatus                 : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 3 ...
##  $ ClosedDate                 : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 538 ...
##  $ BorrowerRate               : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ LenderYield                : num  0.138 0.082 0.24 0.0874 0.1985 ...
##  $ ProsperRating (numeric)    : int  NA 6 NA 6 3 5 2 4 4 NA ...
##  $ ProsperRating (Alpha)      : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 5 1 ...
##  $ ProsperScore               : num  NA 7 NA 9 4 10 2 4 7 NA ...
##  $ ListingCategory (numeric)  : int  0 2 0 16 2 1 1 2 1 0 ...
##  $ BorrowerState              : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 22 1 ...
##  $ EmploymentStatus           : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 3 ...
##  $ IsBorrowerHomeowner        : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 1 1 ...
##  $ CurrentlyInGroup           : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
##  $ CurrentCreditLines         : int  5 14 NA 5 19 21 10 6 2 2 ...
##  $ TotalCreditLinespast7years : int  12 29 3 29 49 49 20 10 56 10 ...
##  $ OpenRevolvingAccounts      : int  1 13 0 7 6 13 6 5 4 1 ...
##  $ OpenRevolvingMonthlyPayment: num  24 389 0 115 220 1410 214 101 25 40 ...
##  $ TotalInquiries             : num  3 5 1 1 9 2 0 16 2 5 ...
##  $ CurrentDelinquencies       : int  2 0 1 4 0 0 0 0 1 3 ...
##  $ AmountDelinquent           : num  472 0 NA 10056 0 ...
##  $ DelinquenciesLast7Years    : int  4 0 0 14 0 0 0 0 28 1 ...
##  $ PublicRecordsLast10Years   : int  0 1 0 0 0 0 0 1 1 0 ...
##  $ RevolvingCreditBalance     : num  0 3989 NA 1444 6193 ...
##  $ BankcardUtilization        : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.51 0.32 ...
##  $ TotalTrades                : num  11 29 NA 26 39 47 16 10 47 7 ...
##  $ DebtToIncomeRatio          : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.12 0.27 ...
##  $ IncomeRange                : Ord.factor w/ 8 levels "$0"<"$1-24,999"<..: 3 4 8 3 6 6 3 3 5 2 ...
##  $ IncomeVerifiable           : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome        : num  3083 6125 2083 2875 9583 ...
##  $ LoanOriginalAmount         : int  9425 10000 3001 10000 15000 15000 3000 10000 13500 1000 ...
##  $ Investors                  : int  258 1 41 158 20 1 1 1 19 53 ...

So now lets see the Employment status data.

## 
##                    Employed     Full-time Not available  Not employed 
##          2255         65912         26353          5347           835 
##         Other     Part-time       Retired Self-employed 
##          3681          1088           795          5973
## 
##                    Employed     Full-time Not available  Not employed 
##          2255         65912         26353          5347           835 
##         Other     Part-time       Retired Self-employed 
##          3681          1088           795          5973

Next I change the Listing Category numeric data to factor variable with designated values.

0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans

##  Factor w/ 11 levels "Auto","Business",..: 6 4 6 1 4 3 3 4 3 6 ...

Debt burden

Debt to Income Ratio

A debt income ratio is the percentage of a consumer’s monthly gross income that goes toward paying debts. The data is capped at 10.01, debt-to-income ratio larger then 1000% will be returned as 1001%.

So lets take a look at the plot.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 8393 rows containing non-finite values (stat_bin).

## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 797 rows containing non-finite values (stat_bin).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8393

Other information

## [1] "A"  "AA" "B"  "C"  "D"  "E"  "HR"

The most common rating is rating C, 17862. Only 5280 listing have AA rating or about 6.34% of the data. That’s very interesting so we get the idea that most loan takers have an average rating. No one pays on time then.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   36.00   36.00   42.44   60.00   60.00

Most loans have 36 months term.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00   13.59   18.97   19.66   25.79   36.00

The median for the borrower rate is 18.97% and mean 19.66%. The maximum borrower rate is 36.00%.

Univariate Analysis

What is/are the main feature(s) of interest in your dataset?

While there are many variables which i have chosen, the main features of the data are:

I chose these variables, because I think these are some imporatant factors while lending money.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think State abbreviation will play a part in knowing which state are vulnerable in Loan frauds. Also listing category will tell us about the loan category for which a person is borrowing money.

Did you create any new variables from existing variables in the dataset?

Yes, ListingCategoryNum.new and EmploymentStatus.char.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The distribution i plotted mainly are positively skewed ie. DebtToIncomeRatio and LoanOriginalAmount . I did plot some graphs using faceting technique to get a hint of what’s going on in a factored variable. I changed Listing Category to a new varibale with simplified categories.


Bivariate Plots Section

So now lets move to two variable analysis to find some interesting patterns

LoanOriginalAmount vs Investors

First lets take a look at LoanOriginalAmount vs Investors such that we can get an idea upto how much an ivester is willing to lend their money and how many invester line up for that amount.

## Warning: Removed 827 rows containing missing values (geom_point).

## `geom_smooth()` using method = 'gam'

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     1.0    30.0   110.1   202.0  1189.0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    3.00   35.00   48.45   73.00  558.00

LoanOriginalAmount vs BorrowerRate

So next we analyze the LoanOriginalAmount vs BorrowerRate

## Warning: Removed 648 rows containing non-finite values (stat_smooth).
## Warning: Removed 1970 rows containing missing values (geom_point).

## [1] -0.413694

LoanOriginalAmount vs AnnualIncome

Now lets have a look at loan amount vs the income range. We will get to know which income range takes loan of upto 10000 dollars.

## Warning: Removed 406 rows containing non-finite values (stat_smooth).
## Warning: Removed 803 rows containing missing values (geom_point).

Most of the Loan are below 10000 and monthly income is under 10000. The quantile shows that the higher the monthly income the higher the median of the loan original amount.

The number of the data that have original amount < 10000 and annual income less then 100000 is:

## [1] 44656

which is around 53.7% of the data.

It seems that people who borrow > 25000 has monthly income of >= 10000 looks like there some kind of rule, that if you borrow > 25000 the the minimal monthly income should be at least 10000.

Let’s verify this a bit.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8333    9583   11667   13328   14583   66667

And yes the minimum monhtly income is 8333 for borrowing that amount of money. Let’s also check the correlation between the 2 variables.

## [1] 0.1813978

This relationship is quite weak, so we can’t be sure about any rules that for taking a loan your income should be a in a particular range.

LoanOriginalAmount vs ListingCategoryNum.new

Now lets have a look at listing category and amount of loan taken

## Warning: Removed 648 rows containing non-finite values (stat_boxplot).

So we see that the highest amount of loan takers come from the debt consolidation area. That will only increase their burden on the loan.

LoanOriginalAmount vs EmploymentStatus

Now we take a look at the employment status of the folks who take loans we have already seen this in Univariate analysis but lets have a more in depth analysis using box plots.

## Warning: Removed 648 rows containing non-finite values (stat_boxplot).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    3000    4000    5272    6750   25000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    1500    2500    3364    4000   15500

Interesting to see that not employed folks are requesting loan higher then part time. The median and the mean of not employed borrower are 4000 and 5272 vs 2500 and 3364 from part time borrower. Max loans for unemployed is 25000 and it is also greater than max loan for part-time borrower which is 15500

StatedMonthlyIncome vs BorrowerRate

Now lets take a look at the borrower rate and Monthly Income

## Warning: Removed 714 rows containing missing values (geom_point).

We can see there is really not much of a relation, only thing we notice is that mainly people who borrow loan are earn below 15000 dollars. And I can see the more the income less is your interest rate let us check if there is any negative correlation between them.

## [1] -0.09317254

Okay so there is a negative correlation of -0.093 but this is very small, so we can’t really say anything.

Let’s see the relationship of the Loan Amount with other variables.

So we can see that it is the region between (0-15000) dollars which is the most active and as we can tell BankcardUtilization, Delinquencies, PublicRecordsLast10Years and RevolvingCreditBalance all these variables usually a loan taker in the above range has a certain record related to these which would hamper their prosper rating.

So it seems we should be careful while lending money below 15000 dollars at the least.

ProsperRatingAlpha vs LoanOriginalAmount

Let’s look into prosper rating also.

## Warning: Removed 648 rows containing non-finite values (stat_boxplot).

Lets take a look at the summary by LoanOriginalAmount

## prosperLoans.sub$ProsperRatingAlpha: AA
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    5700   10500   11547   16000   35000 
## -------------------------------------------------------- 
## prosperLoans.sub$ProsperRatingAlpha: A
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    5750   10000   11434   15000   35000 
## -------------------------------------------------------- 
## prosperLoans.sub$ProsperRatingAlpha: B
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    6000   10000   11580   15000   35000 
## -------------------------------------------------------- 
## prosperLoans.sub$ProsperRatingAlpha: C
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    5000   10000   10356   15000   25000 
## -------------------------------------------------------- 
## prosperLoans.sub$ProsperRatingAlpha: D
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6000    7065   10000   15000 
## -------------------------------------------------------- 
## prosperLoans.sub$ProsperRatingAlpha: E
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    3500    4000    4564    5000   15900 
## -------------------------------------------------------- 
## prosperLoans.sub$ProsperRatingAlpha: HR
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    3000    4000    3462    4000   16800

So the maximum loan taken is 35000 dollars and it has got the highest prosper rating too AA type. The median loan value of 4000 dollars lead to a prosper rating of ‘E’ and ‘HR’.

I don’t understand why people with such low loan amount cannot pay thier debts in time.

Lets check only ‘E’ and ‘HR’ type rating with the Listing category.

So we can see that these folks are generally students and it is as expected. However there are 2 more category which has a large amount of ‘E’ and “HR” prosper rating those are Auto and Other types.

Note: Auto means those who have taken loan for automobile of any kind.

ProsperRatingAlpha vs Term

Now lets look at prosper rating and Loan term

## [1] "12" "36" "60"

Take a look at the summary of each prosper rating with respect to Term

## prosperLoans.sub$ProsperRatingAlpha: AA
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   36.00   36.00   38.34   36.00   60.00 
## -------------------------------------------------------- 
## prosperLoans.sub$ProsperRatingAlpha: A
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   36.00   36.00   41.56   60.00   60.00 
## -------------------------------------------------------- 
## prosperLoans.sub$ProsperRatingAlpha: B
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   36.00   36.00   45.11   60.00   60.00 
## -------------------------------------------------------- 
## prosperLoans.sub$ProsperRatingAlpha: C
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   36.00   36.00   45.96   60.00   60.00 
## -------------------------------------------------------- 
## prosperLoans.sub$ProsperRatingAlpha: D
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   36.00   36.00   42.43   60.00   60.00 
## -------------------------------------------------------- 
## prosperLoans.sub$ProsperRatingAlpha: E
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   36.00   36.00   39.88   36.00   60.00 
## -------------------------------------------------------- 
## prosperLoans.sub$ProsperRatingAlpha: HR
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      36      36      36      36      36      36

So mostly people take a term of 36 months or 3 years.

Now let’s look at the 3 categories we found earlier in Lending category and see their term, borrower rate,TotalInquiries, DelinquenciesLast7Years, PublicRecordsLast10Years

SO we can clearly see people who take loans for Auto purposes are the most risky ones.

Now let’s look at the correlation of Prosper rating with some variables and find out which factors are more important.

Correlation between BorrowerRate with Prosper Rating:

## [1] -0.9531032

Correlation between Prosper rating with RevolvingCreditBalance:

## [1] 0.06221151

Correlation between Prosper rating with BankcardUtilization:

## [1] -0.2658791

Correlation between Prosper rating with Term:

## [1] 0.08213252

Correlation between Prosper rating with OpenRevolvingAccounts:

## [1] 0.1241319

Correlation between Prosper rating with DebtToIncomeRatio:

## [1] -0.1353105

Correlation between Prosper rating with LoanOriginalAmount:

## [1] 0.4298129

So prosper rating is higly dependent on Borrower rate. They have a negative correlation of -0.9531032.

2nd is Bank card utilization with a negative correlation of -0.2658791.

3rd is Open revolving accounts with a positive correlation of 0.1241319.

Finally Prosper rating is positively correlated with loan amount having pearson’s R of 0.4298129

So Borrower rate and loan amount is the thing to watch here along with the other 2 parameters.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I wanted to see how several features affect the LoanOriginalAmount. So i took help of Investors, BorrowerRate, StatedMonthlyIncome, ListingCategoryNum.new, EmploymentStatus and ProsperRatingAlpha.

For almost every case i used 99 % of the data removin any outliers from my investigation.

I found that the maximum loan taken is 35000 dollars and it has got the highest prosper rating too AA type. The highest amount of loan takers come from the debt consolidation area. I observed that students, Auto and Other types listing category fill up the ‘E’ and “HR” prosper rating.

SO we can clearly see people who take loans for Auto purposes are the most risky ones.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

No.

What was the strongest relationship you found?

So i found that mainly 3 types of listing category ie. Auto, Other and Students are the most vulnerable to late payments or being a defaulter. On the basis of this I plotted some boxplot for these 3 categories and other variables such as Term, BorrowerRate, DelinquenciesLast7Years, PublicRecordsLast10Years and TotalInquiries so that we can find out some data having some bad records to their name. We find that mainly there are records for Auto and Other categories in these plots. So for these 2 listing category the money lenders should take precaution and check the background of the person to whom they are lending money.

I also plotted ProsperRatingAlpha vs Term but for Term it is not affected by the rating, the most common terms is 36 across the ratings.

For LoanOriginalAmount vs BorrowerRate We find out that as loan amount increases the rate of interest decreases having pearson’s R as -0.413694.

Finally i found out correlation between ProsperRatingNum and various variables and found out prosper rating is higly dependent on Borrower rate with a negative correlation of -0.9531032 and is positively correlated with loan amount having pearson’s R of 0.4298129.

So these are some of the strongest relationships i have found.


Multivariate Plots Section

So we see that prosper rating has a positive correlation for loan amount and Open revolving accounts. So let us first plot these variables with Monthly income

I am transforming Monthly income to yearly income to see the trend

options(scipen = 999)

ggplot(aes(x = StatedMonthlyIncome * 12, y = LoanOriginalAmount), 
       data = subset(prosperLoans.sub, 
                     StatedMonthlyIncome*12 < 
                       quantile(StatedMonthlyIncome*12, 0.99, na.rm = TRUE))) +
  geom_point(aes(color=ProsperRatingAlpha),  alpha = 0.5, 
             size = 1, position = 'jitter') +
  scale_x_log10() + 
  scale_color_brewer(type="div", palette = 3) +
  xlab("Yearly Income")
## Warning: Transformation introduced infinite values in continuous x-axis

We see that those who took a loan amount above 25000 dollars are in AA rating and as the loan amount increases the rating increases. This proves our correlation. This was wierd at first but then i recalled the plot i made earlier from which i got that mostly people with low prosper rating are students, Auto and Other types.

Now next we see which ListingCategory has most of the loans which correspond to their low prosper rating.

ggplot(aes(x = StatedMonthlyIncome * 12, y = LoanOriginalAmount), 
       data = subset(prosperLoans.sub, 
                     (StatedMonthlyIncome*12 < 
                       quantile(StatedMonthlyIncome*12, 0.99, na.rm = TRUE)) &
                       ListingCategoryNum.new == "Auto" |
                       ListingCategoryNum.new == "Student" |
                       ListingCategoryNum.new == "Other/NA")) +
  geom_point(aes(color=ListingCategoryNum.new), alpha = 0.5, size = 1, position = 'jitter') +
  scale_x_log10() + 
  scale_color_brewer(type="fill", palette = "PuBu")+
  xlab("Yearly Income")
## Warning: Transformation introduced infinite values in continuous x-axis

So we see that most of the loans below 25000 dollars comprise of Other or Auto Listing Category.

Alright let’s see the plot of OPen revolving accounts vs TotalCreditLinespast7years

ggplot(aes(x = OpenRevolvingAccounts, y = TotalCreditLinespast7years), 
       data = prosperLoans.sub) +
  geom_point(aes(color=ProsperRatingAlpha),  alpha = 0.5, 
             size = 1, position = 'jitter') +
  scale_x_continuous() + 
  scale_color_brewer(type="div", palette = 3)

So I wanted to see how many credit profiles a person had and their open revolving accounts. It seemed to me that they both are positively correlated and isn’t it obvious more the accounts more credit lines a person has.

However I could not make anything out of the prosper rating, it looked very mixed up.

Now let’s look at loan amount vs AmountDelinquent

ggplot(aes(x = LoanOriginalAmount, y = AmountDelinquent), 
       data = subset(prosperLoans.sub, 
                     (AmountDelinquent < 
                       quantile(AmountDelinquent, 0.99, na.rm = TRUE)))) +
  geom_point(aes(color=ProsperRatingAlpha),  alpha = 0.5, 
             size = 1, position = 'jitter') +
  scale_x_continuous() + 
  scale_color_brewer(type="div", palette = 3)

As expected we see that people with less amount Delinquent fall into good Prosper rating. Most of the people who has large amount delinquent have low propser rating, however thier loans are not that much but still they don’t pay in time. That is some strange characteristic.

Now we check the PublicRecordsLast10Years vs Loan amount

ggplot(aes(x = LoanOriginalAmount, y = PublicRecordsLast10Years), 
       data = subset(prosperLoans.sub, 
                     (PublicRecordsLast10Years < 10))) +
  geom_point(aes(color=ProsperRatingAlpha),  alpha = 0.5, 
             size = 1, position = 'jitter') +
  scale_x_continuous() + 
  scale_color_brewer(type="div", palette = 3)

Here we notice that more the public record less is the prosper rating. But again people whose loan are less than 10000 dollars have more public records.

Now let’s see how many of them are students or in other listing category

ggplot(aes(x = LoanOriginalAmount, y = PublicRecordsLast10Years), 
       data = subset(prosperLoans.sub, 
                     (PublicRecordsLast10Years < 10) & 
                       (ListingCategoryNum.new == "Auto" |
                       ListingCategoryNum.new == "Student" |
                       ListingCategoryNum.new == "Other/NA"))) +
  geom_point(aes(color=ListingCategoryNum.new), size = 1, 
             position = 'jitter') +
  scale_x_continuous() + 
  scale_color_brewer(type="div", palette = 'RdBu' )  

So from each plot i can see that those who take loan for Auto and Other have past records attached to them. Finally let’s look at the income range for these listing categories

ggplot(aes(x = IncomeRange, y = LoanOriginalAmount, fill = ListingCategoryNum.new), 
       data = subset(prosperLoans.sub, 
                       ListingCategoryNum.new == "Auto" |
                       ListingCategoryNum.new == "Student" |
                       ListingCategoryNum.new == "Other/NA"))  +
  geom_boxplot()

Finally lets take out the summary of yearly income wrt to these particular listing category. From this we get an idea of how much these folks earn. By looking at the above plot it seems the Other category earns much more than the other two. I can’t understand what’s their problem in settling their loan amount in time. But in future if you want to lend someone money, one should check if they are particularly from these background and proceed with caution.

Finally getting back to my final analysis and after that I will model the data for prdicting Prosper rating

First let me make a new data frame which contains summarized values for these 3 listing categories just for reference.

listing_summary <- subset(prosperLoans.sub, 
         ListingCategoryNum.new == "Auto" |
         ListingCategoryNum.new == "Student" |
         ListingCategoryNum.new == "Other/NA") %>%
  group_by(ListingCategoryNum.new) %>% 
  summarise (mean_Income = round(mean(StatedMonthlyIncome*12), digits = 0),
             median_Income = round(median(StatedMonthlyIncome*12), digits = 0),
             mean_loan = round(mean(LoanOriginalAmount), digits = 0),
             max_loan = round(max(LoanOriginalAmount), digits = 0),
             min_loan = round(min(LoanOriginalAmount), digits = 0),
             number_of_people = n())
#round(x, digits = 0)

View(listing_summary)

Obviously we can see that most of the people approximately 10339 fall in Other/NA listing category and they are the ones who has highest earnings among the other 2 categories but still they have a bad name and do not pay loan in time or get defaulted.

Now, let’s do our model for predicting ProsperRating. I am going to add in the individual features for this.

library(memisc)
## Loading required package: lattice
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## 
## Attaching package: 'memisc'
## The following objects are masked from 'package:dplyr':
## 
##     collect, recode, rename
## The following object is masked from 'package:plyr':
## 
##     rename
## The following objects are masked from 'package:stats':
## 
##     contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
## 
##     as.array
m1 <- lm(ProsperRatingNum ~ LoanOriginalAmount, data = prosperLoans.sub)
m2 <- update(m1, ~ . + DebtToIncomeRatio)
m3 <- update(m2, ~ . + OpenRevolvingAccounts)
m4 <- update(m3, ~ . + TotalTrades)
m5 <- update(m4, ~ . + StatedMonthlyIncome)
m6 <- update(m5, ~ . + TotalInquiries)
m7 <- update(m6, ~ . + BorrowerRate)
m8 <- update(m7, ~ . + BankcardUtilization)

mtable(m1, m2, m3, m4, m5, m6, m7, m8, sdigits = 3)
## 
## Calls:
## m1: lm(formula = ProsperRatingNum ~ LoanOriginalAmount, data = prosperLoans.sub)
## m2: lm(formula = ProsperRatingNum ~ LoanOriginalAmount + DebtToIncomeRatio, 
##     data = prosperLoans.sub)
## m3: lm(formula = ProsperRatingNum ~ LoanOriginalAmount + DebtToIncomeRatio + 
##     OpenRevolvingAccounts, data = prosperLoans.sub)
## m4: lm(formula = ProsperRatingNum ~ LoanOriginalAmount + DebtToIncomeRatio + 
##     OpenRevolvingAccounts + TotalTrades, data = prosperLoans.sub)
## m5: lm(formula = ProsperRatingNum ~ LoanOriginalAmount + DebtToIncomeRatio + 
##     OpenRevolvingAccounts + TotalTrades + StatedMonthlyIncome, 
##     data = prosperLoans.sub)
## m6: lm(formula = ProsperRatingNum ~ LoanOriginalAmount + DebtToIncomeRatio + 
##     OpenRevolvingAccounts + TotalTrades + StatedMonthlyIncome + 
##     TotalInquiries, data = prosperLoans.sub)
## m7: lm(formula = ProsperRatingNum ~ LoanOriginalAmount + DebtToIncomeRatio + 
##     OpenRevolvingAccounts + TotalTrades + StatedMonthlyIncome + 
##     TotalInquiries + BorrowerRate, data = prosperLoans.sub)
## m8: lm(formula = ProsperRatingNum ~ LoanOriginalAmount + DebtToIncomeRatio + 
##     OpenRevolvingAccounts + TotalTrades + StatedMonthlyIncome + 
##     TotalInquiries + BorrowerRate + BankcardUtilization, data = prosperLoans.sub)
## 
## =========================================================================================================================
##                             m1          m2          m3          m4          m5          m6          m7          m8       
## -------------------------------------------------------------------------------------------------------------------------
##   (Intercept)            3.026***    3.291***    3.199***    3.279***    3.259***    3.514***     8.144***    8.207***   
##                         (0.009)     (0.011)     (0.013)     (0.014)     (0.015)     (0.015)      (0.008)     (0.008)     
##   LoanOriginalAmount     0.000***    0.000***    0.000***    0.000***    0.000***    0.000***     0.000***    0.000***   
##                         (0.000)     (0.000)     (0.000)     (0.000)     (0.000)     (0.000)      (0.000)     (0.000)     
##   DebtToIncomeRatio                 -0.662***   -0.691***   -0.688***   -0.670***   -0.704***    -0.081***   -0.077***   
##                                     (0.017)     (0.017)     (0.017)     (0.017)     (0.017)      (0.006)     (0.006)     
##   OpenRevolvingAccounts                          0.016***    0.026***    0.026***    0.027***    -0.006***   -0.005***   
##                                                 (0.001)     (0.001)     (0.001)     (0.001)      (0.000)     (0.000)     
##   TotalTrades                                               -0.006***   -0.007***   -0.001        0.001***    0.002***   
##                                                             (0.001)     (0.001)     (0.001)      (0.000)     (0.000)     
##   StatedMonthlyIncome                                                    0.000***    0.000***     0.000***    0.000***   
##                                                                         (0.000)     (0.000)      (0.000)     (0.000)     
##   TotalInquiries                                                                    -0.088***    -0.011***   -0.014***   
##                                                                                     (0.001)      (0.000)     (0.000)     
##   BorrowerRate                                                                                  -20.864***  -20.519***   
##                                                                                                  (0.027)     (0.028)     
##   BankcardUtilization                                                                                        -0.262***   
##                                                                                                              (0.006)     
## -------------------------------------------------------------------------------------------------------------------------
##   R-squared                   0.185       0.194       0.195       0.197       0.197       0.236      0.912       0.914   
##   adj. R-squared              0.185       0.193       0.195       0.197       0.197       0.236      0.912       0.914   
##   sigma                       1.515       1.493       1.492       1.490       1.490       1.453      0.494       0.489   
##   F                       18842.573    9119.598    6152.818    4653.560    3733.543    3920.892 112002.454  100520.274   
##   p                           0.000       0.000       0.000       0.000       0.000       0.000      0.000       0.000   
##   Log-likelihood        -152519.030 -138351.932 -138263.518 -138200.794 -138179.222 -136277.381 -54311.232  -53430.710   
##   Deviance               190782.908  169525.245  169131.377  168852.504  168756.700  160520.678  18578.153   18152.727   
##   AIC                    305044.061  276711.865  276537.037  276413.588  276372.443  272570.762 108640.464  106881.420   
##   BIC                    305072.046  276748.820  276583.231  276469.020  276437.115  272644.672 108723.613  106973.808   
##   N                       83155       76020       76020       76020       76020       76020      76020       76020       
## =========================================================================================================================

The overall R-squared value is 0.914, which is very strong;Wow I am surprised that my model will be predicting the ProsperRating with this much accuracy. And as it looks BorrowerRate might be the most useful varable in our list for predicting Prosper Rating.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the  investigation. Were there features that strengthened each other in terms of  looking at your feature(s) of interest?

Loans with higher ProsperRating have higher Loan amount borrowed and vice versa. Yearly Incmone is not a factor for Prosper Rating.

I focussed my investigation on the 3 listing categories I found from Bivariate analysis that have low prosper rating. And i found that most loans were taken by Other and Auto listing category and some going even higher than 20000 dollars.

Next i got that folks with low Amount Delinquent have higher Prosper Rating and vice versa, same is for number of Public Records. Most of the people who has large amount delinquent or more public records have low propser rating, however thier loans are not that much big amount but still they don’t pay in time.

So finally i see that those who take loan for Automobile and Other purposes which are Large Purchase or some green loans or cosmetic surgery are very vulnerable to untimely payments and have past records attached to them.

Were there any interesting or surprising interactions between features?

Yes there was in one of my final charts while investigating the 3 listing categories i found that those who take loans with Other Listing Category earns much more than the other two. I can’t understand their problem in settling their loan amount in time. So in future the bank or lender should check their background in these categories especially Auto and Other and borrow money.

OPTIONAL: Did you create any models with your dataset? Discuss the  strengths and limitations of your model.

Yes, I created a linear model with dependent variable as the ProsperRating and the predictors as the Loan Original Amount, DebtToIncomeRatio, OpenRevolvingAccounts, TotalTrades, StatedMonthlyIncome, TotalInquiries, BorrowerRate and BankcardUtilization.

The variables in the linear model account for 91.4% of the variance in the Prosper Rating. This means that there is a 91.4% chance that the model will hold true for predicting the Prosper Rating and that’s quite a model i created unknowingly. Mostly if you use this model you would get an accurate idea of Prosper Rating.

So I guess, the model’s limitation would be the huge amount of variables that I did not use for predicting ProsperRating. I only choose these 8 variables as per my above analysis. I don’t know how adding the other variables would effect my model.

Final Plots and Summary

So for my final plots I choose to see the data over a chloropleth. All this time I did my analyis build the model for predicting Prosper Rating. Now let’s use the Borrower state data to look at other insights from a birds eye view.

Okay now lets take a look at statewise distribution of mean loan amount.

So for that we need to use maps package and merge the state data with our data.

#x <- c("NY", "MA")
#state.name[match(x,state.abb)]

#Creating a new data frame with state data imported from maps package

#install.packages('maps')

library(maps)
## 
## Attaching package: 'maps'
## The following object is masked from 'package:plyr':
## 
##     ozone
library(ggmap)
library(maptools)
## Loading required package: sp
## Checking rgeos availability: TRUE
states_map <- map_data("state")


#states_map

#Matching state data of peosper loan data set to state.abb(state abbreviations) and getting the state names


prosperLoans.sub$State <- state.name[match(prosperLoans.sub$BorrowerState, state.abb)]

#prosperLoans.sub$State




#Converting the state names to lower case for the join i need to apply

prosperLoans.sub$State <- sapply(prosperLoans.sub$State, tolower)

#prosperLoans.sub$State


#Summarizing the data and finding out mean loans for each state
df_state <- subset(prosperLoans.sub, !is.na(DebtToIncomeRatio)) %>% group_by(State) %>% 
  summarize(mean_loan = mean(LoanOriginalAmount),
            Total_Inquiries = sum(TotalInquiries),
            Total_Records = sum(PublicRecordsLast10Years),
            no_of_investors = sum(Investors))




by(prosperLoans.sub$DebtToIncomeRatio, prosperLoans.sub$State, summary)
## prosperLoans.sub$State: alabama
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0200  0.1700  0.2500  0.2838  0.3400  7.2700      78 
## -------------------------------------------------------- 
## prosperLoans.sub$State: alaska
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0300  0.1600  0.2400  0.2547  0.3425  0.6700      17 
## -------------------------------------------------------- 
## prosperLoans.sub$State: arizona
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0200  0.1450  0.2100  0.2354  0.3000  1.2800      88 
## -------------------------------------------------------- 
## prosperLoans.sub$State: arkansas
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0200  0.1800  0.2700  0.2995  0.3600  7.2700      65 
## -------------------------------------------------------- 
## prosperLoans.sub$State: california
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.1300  0.2000  0.2268  0.2800 10.0100     995 
## -------------------------------------------------------- 
## prosperLoans.sub$State: colorado
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0200  0.1600  0.2300  0.2713  0.3100 10.0100     165 
## -------------------------------------------------------- 
## prosperLoans.sub$State: connecticut
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.1400  0.2100  0.2577  0.3000 10.0100     103 
## -------------------------------------------------------- 
## prosperLoans.sub$State: delaware
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0200  0.1700  0.2300  0.2917  0.3200 10.0100      10 
## -------------------------------------------------------- 
## prosperLoans.sub$State: florida
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.1600  0.2300  0.2666  0.3200 10.0100     557 
## -------------------------------------------------------- 
## prosperLoans.sub$State: georgia
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0100  0.1600  0.2300  0.2701  0.3300 10.0100     290 
## -------------------------------------------------------- 
## prosperLoans.sub$State: hawaii
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.1600  0.2300  0.2619  0.3350  2.2800      31 
## -------------------------------------------------------- 
## prosperLoans.sub$State: idaho
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0200  0.1700  0.2600  0.2998  0.3500  6.4900      41 
## -------------------------------------------------------- 
## prosperLoans.sub$State: illinois
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.1500  0.2200  0.2482  0.3100 10.0100     318 
## -------------------------------------------------------- 
## prosperLoans.sub$State: indiana
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0200  0.1600  0.2400  0.2754  0.3200 10.0100     114 
## -------------------------------------------------------- 
## prosperLoans.sub$State: kansas
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0200  0.1800  0.2500  0.2762  0.3500  1.6000      62 
## -------------------------------------------------------- 
## prosperLoans.sub$State: kentucky
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0100  0.1800  0.2600  0.3037  0.3500  8.5200      63 
## -------------------------------------------------------- 
## prosperLoans.sub$State: louisiana
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0200  0.1700  0.2650  0.2933  0.3900  1.1500      61 
## -------------------------------------------------------- 
## prosperLoans.sub$State: maryland
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0100  0.1400  0.2100  0.2313  0.2900  2.4500     146 
## -------------------------------------------------------- 
## prosperLoans.sub$State: massachusetts
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0100  0.1400  0.2000  0.2392  0.2900 10.0100     111 
## -------------------------------------------------------- 
## prosperLoans.sub$State: michigan
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0100  0.1600  0.2300  0.2616  0.3300  5.3800     205 
## -------------------------------------------------------- 
## prosperLoans.sub$State: minnesota
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0100  0.1600  0.2250  0.2532  0.3200 10.0100     148 
## -------------------------------------------------------- 
## prosperLoans.sub$State: mississippi
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0300  0.1800  0.2600  0.2877  0.3700  1.1500      57 
## -------------------------------------------------------- 
## prosperLoans.sub$State: missouri
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0200  0.1700  0.2400  0.2683  0.3300  6.0700     166 
## -------------------------------------------------------- 
## prosperLoans.sub$State: montana
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0100  0.1700  0.2700  0.3111  0.3700  5.9800      25 
## -------------------------------------------------------- 
## prosperLoans.sub$State: nebraska
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0400  0.1900  0.2700  0.2876  0.3600  1.1900      47 
## -------------------------------------------------------- 
## prosperLoans.sub$State: nevada
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0300  0.1500  0.2200  0.2516  0.3100 10.0100      76 
## -------------------------------------------------------- 
## prosperLoans.sub$State: new hampshire
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0100  0.1500  0.2300  0.2571  0.3200  4.4300      28 
## -------------------------------------------------------- 
## prosperLoans.sub$State: new jersey
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.010   0.140   0.210   0.237   0.290  10.010     213 
## -------------------------------------------------------- 
## prosperLoans.sub$State: new mexico
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0300  0.1700  0.2600  0.2707  0.3400  1.0800      35 
## -------------------------------------------------------- 
## prosperLoans.sub$State: new york
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0100  0.1300  0.2000  0.2394  0.2900 10.0100     543 
## -------------------------------------------------------- 
## prosperLoans.sub$State: north carolina
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0200  0.1600  0.2300  0.2681  0.3250 10.0100     202 
## -------------------------------------------------------- 
## prosperLoans.sub$State: ohio
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0200  0.1700  0.2500  0.2837  0.3500 10.0100     253 
## -------------------------------------------------------- 
## prosperLoans.sub$State: oklahoma
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0100  0.1700  0.2500  0.2764  0.3500  2.5300      55 
## -------------------------------------------------------- 
## prosperLoans.sub$State: oregon
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.020   0.160   0.230   0.258   0.320  10.010     140 
## -------------------------------------------------------- 
## prosperLoans.sub$State: pennsylvania
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0200  0.1600  0.2400  0.2651  0.3200  6.8500     213 
## -------------------------------------------------------- 
## prosperLoans.sub$State: rhode island
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0400  0.1400  0.2100  0.2481  0.2900  6.6800      27 
## -------------------------------------------------------- 
## prosperLoans.sub$State: south carolina
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.020   0.170   0.250   0.299   0.350  10.010     100 
## -------------------------------------------------------- 
## prosperLoans.sub$State: south dakota
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.050   0.210   0.280   0.293   0.360   0.600      23 
## -------------------------------------------------------- 
## prosperLoans.sub$State: tennessee
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0200  0.1700  0.2500  0.2913  0.3500 10.0100     121 
## -------------------------------------------------------- 
## prosperLoans.sub$State: texas
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0100  0.1700  0.2400  0.2687  0.3400  4.6800     460 
## -------------------------------------------------------- 
## prosperLoans.sub$State: utah
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.020   0.160   0.230   0.276   0.310  10.010      68 
## -------------------------------------------------------- 
## prosperLoans.sub$State: vermont
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0300  0.1700  0.2700  0.2924  0.3800  1.2100      14 
## -------------------------------------------------------- 
## prosperLoans.sub$State: virginia
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0100  0.1500  0.2200  0.2545  0.3200 10.0100     210 
## -------------------------------------------------------- 
## prosperLoans.sub$State: washington
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.1500  0.2200  0.2566  0.3100 10.0100     200 
## -------------------------------------------------------- 
## prosperLoans.sub$State: west virginia
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0300  0.1800  0.2600  0.2873  0.3600  4.1600      27 
## -------------------------------------------------------- 
## prosperLoans.sub$State: wisconsin
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0200  0.1600  0.2300  0.2788  0.3300 10.0100     135 
## -------------------------------------------------------- 
## prosperLoans.sub$State: wyoming
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0300  0.2400  0.3000  0.3237  0.3800  1.3900      14
str(df_state)
## Classes 'tbl_df', 'tbl' and 'data.frame':    48 obs. of  5 variables:
##  $ State          : chr  "alabama" "alaska" "arizona" "arkansas" ...
##  $ mean_loan      : num  8660 10354 8886 8098 9847 ...
##  $ Total_Inquiries: num  3953 711 7131 2331 53577 ...
##  $ Total_Records  : int  370 26 374 251 2457 507 403 75 1173 945 ...
##  $ no_of_investors: int  67172 10498 77420 36916 719889 104363 97664 18222 348568 235391 ...
#Renaming the columns
#df_state <- rename(df_state, state = BorrowerState_new)


#making a new data frame which contains the centre coordinate of each state and its abbreviation
states <- data.frame(state.center, state.abb,state.name)
library(ggthemes)

# creating new data frame by merging states_map & df_state
new_map <- merge(states_map, df_state, by.x = "region", by.y = "State")


new_map <- arrange(new_map, group, order) # to sort polygons in right order

Plot One

## Warning: Ignoring unknown parameters: linestyle

Description One

This chloropleth shows the statewise distribution of mean loan amount. So from the map we see that folks living in California, Texas, New Mexico, Massachusetts, New Jersey and Virginia take huge amount of loans as compared to other states.

Plot Two

Now let’s see the number of investors of each state along with loan amount.

#First we need to make a new dataframe which contains the summarized value and the centers of each state


states$state.name <- sapply(states$state.name, tolower)

summarized_map <- merge(states, df_state, by.x = "state.name", by.y = "State")


new_map <- arrange(new_map, group, order) # to sort polygons in right order
## Warning: Ignoring unknown parameters: linestyle

Description Two

We can see that California, Texas, New York, Illinois, Florida, Georgia, Ohio and Virginia are some main hubs for investors. Maybe these places are safe for lending your money or giving loans. These are the main states to look for while lending money sine there are so many investors in these places

Plot Three

Now lets look at the total number of Inquiries and Records for the States.

## Warning: Ignoring unknown parameters: linestyle

Description Three

So in the last plot we are able to see which states have a bad reputation in settling loans. So we can see that California, Texas, Florida, Illinois, Virginia, New York, Ohia and Michigan have some records associated with Inquiries. So this suggests that people living in these states have been inquired previously for unsettling their loans or have a bad background.


Reflection

The Prosper data set originally contains 113,937 records with 81 features and I was able to trim it down to 83,155 records; removing records with duplicate ListingKeys and those with NA and empty ProsperRating.

There are a lot of challenges which I faced with this project. One is not being familiar with Prosper, so going through their website and reading information on Prosper listing certainly helped, especially in understanding the different features of the Prosper loan.

Another challenge was choosing which feature/variable was the most interesting. The problem with Prosper is that it has many interesting features, so it is hard to choose which one is the most interesting, like ProsperScore, Borrower’s Rate, Loan Original Amount etc. Among all these Prosper Rating was the most important feature since it is a measurement of a Loan’s risk. So this made me interested on seeing the listing category which would have low Prosper Rating since they would be the most vulnerable in settling their loans.

The most difficult task I find in this study was choosing the 20 - 30 features out of 81. So I mainly choose the variable as per my instinct and which I was familiar with. So, I read about the features from the Prosper website and finally, I came up with 33 features. But after all my analysis I saw that mainly I used 20 - 22 features for my analysis.

While starting this project I had one thing in mind to find out type of people who make up risky loans. So i carried out my analysis accordingly. First I updated the listing category to fewer categories for my ease of use. Since all were numbered I renamed them to their original meaning so that i could understand what i am dealing with. And I made 11 categories out of it. And after Bivariate Analysis I got the idea that people with Listing Category Student, Other/NA and Auto are very risky. It was after Multivariate analysis that I concluded that loan for Automobile and Other purposes which include Large Purchase, green loans and cosmetic surgery are very vulnerable to untimely payments and have past records attached to them.

One thing i am happy about is my Linear Model Which I made since it has an acuuracy of 91.4%. I don’t know but by hit and trial i got the best variables for the model.

One thing i want to look at is a more improved way to look at the data over a map/chloropleth. In my final plots section I created map using State name abbreviation provided in the data. But there are some techniques in this which are unknown to me. In my third plot the legends are overlapping and I am yet to find a solution to that. This is a very tricky thing but i hope to find some solution to this problem. Coming back to data there are many unused variables. We can use other variables also, In the data there is information on borrower credit score. So it will be interesting to see what kind of relationship exists between credit score and ProsperRating.


References

[1] https://www.prosper.com/help/ [2] https://www.prosper.com/help/topics/how-to-read-a-loan-listing/ [3] www.google.com [4] https://s3.amazonaws.com/udacity-hosted-downloads/ud651/GeographyOfAmericanMusic.html [5] http://www.statmethods.net/ [6] http://www.cookbook-r.com/Manipulating_data/ [7] https://stackoverflow.com/questions/17723822/administrative-regions-map-of-a-country-with-ggmap-and-ggplot2